Abstract
This study presents a comprehensive Bayesian hierarchical analysis of
survival patterns among Titanic passengers. Using Markov Chain Monte
Carlo (MCMC) methods implemented through JAGS, we investigate the
probabilistic relationships between passenger characteristics and
survival outcomes. The analysis employs a hierarchical logistic
regression model with random effects for embarkation ports,
incorporating weakly informative priors following current Bayesian best
practices. Our findings quantify the dramatic effects of gender, social
class, age, and family structure on survival probability, providing
probabilistic estimates with full uncertainty quantification that
classical frequentist approaches cannot deliver.
Introduction and
Motivation
The sinking of RMS Titanic on April 14-15, 1912, represents one of
the most extensively documented maritime disasters in modern history.
Beyond its historical significance, this tragic event provides an
exceptional opportunity for statistical analysis, offering a rich
dataset that allows us to investigate the factors determining human
survival under extreme conditions.
Rationale for
Bayesian Methodology
The Bayesian approach offers several methodological advantages over
classical frequentist methods for this analysis:
Natural Uncertainty Quantification: Bayesian
inference provides complete posterior distributions for all parameters,
enabling direct probability statements about parameter values.
Hierarchical Structure: The natural grouping of
passengers by embarkation port creates a hierarchical data structure
that Bayesian methods handle elegantly through random effects.
Prior Information Integration: Bayesian methods
allow incorporation of historical knowledge about human behavior in
emergency situations through informative prior distributions.
Model Comparison: Bayesian model selection
criteria (DIC, WAIC, LOO) provide principled approaches to compare
competing hypotheses.
Research
Objectives
This investigation aims to:
- Quantify the probabilistic impact of demographic and socioeconomic
factors on survival
- Develop a predictive model capable of generalizing to similar
emergency scenarios
- Demonstrate methodological superiority of Bayesian approaches
through rigorous validation
- Provide uncertainty quantification for all inferences through
complete posterior distributions
Dataset Description and
Exploratory Analysis
Data Structure and
Characteristics
# Load and examine the Titanic dataset
titanic_raw <- read_csv('/Users/roberto/Desktop/titanic_project/train (2).csv',
show_col_types = FALSE)
# Display dataset dimensions and structure
cat("Dataset Dimensions:", nrow(titanic_raw), "observations,", ncol(titanic_raw), "variables\n")
Dataset Dimensions: 891 observations, 12 variables
glimpse(titanic_raw)
Rows: 891
Columns: 12
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
The dataset contains 891 passenger records with 12 variables,
representing a subset of the complete passenger manifest. Key variables
include survival status (binary), passenger class (ordinal), demographic
characteristics (age, sex), family composition (siblings/spouses,
parents/children), fare paid, and embarkation port.
Missing Data
Analysis
# Comprehensive missing data analysis
missing_analysis <- titanic_raw %>%
summarise_all(~sum(is.na(.))) %>%
gather(Variable, Missing_Count) %>%
mutate(Missing_Percentage = round(Missing_Count / nrow(titanic_raw) * 100, 2)) %>%
arrange(desc(Missing_Count))
missing_analysis %>%
kable(caption = "Missing Data Patterns",
col.names = c("Variable", "Missing Count", "Missing %"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Missing Data Patterns
|
Variable
|
Missing Count
|
Missing %
|
|
Cabin
|
687
|
77.10
|
|
Age
|
177
|
19.87
|
|
Embarked
|
2
|
0.22
|
|
PassengerId
|
0
|
0.00
|
|
Survived
|
0
|
0.00
|
|
Pclass
|
0
|
0.00
|
|
Name
|
0
|
0.00
|
|
Sex
|
0
|
0.00
|
|
SibSp
|
0
|
0.00
|
|
Parch
|
0
|
0.00
|
|
Ticket
|
0
|
0.00
|
|
Fare
|
0
|
0.00
|
# Visualize missing data patterns
aggr(titanic_raw, col = c('navyblue', 'red'),
numbers = TRUE, sortVars = TRUE,
main = "Missing Data Pattern Analysis")

Variables sorted by number of missings:
Variable Count
Cabin 0.771043771
Age 0.198653199
Embarked 0.002244669
PassengerId 0.000000000
Survived 0.000000000
Pclass 0.000000000
Name 0.000000000
Sex 0.000000000
SibSp 0.000000000
Parch 0.000000000
Ticket 0.000000000
Fare 0.000000000
# Test MCAR assumption
mcar_result <- mcar_test(titanic_raw)
cat("\nMCAR Test Results:\n")
MCAR Test Results:
cat("Test Statistic:", round(mcar_result$statistic, 3), "\n")
Test Statistic: 598.542
cat("p-value:", round(mcar_result$p.value, 3), "\n")
p-value: 0
cat("Conclusion:", ifelse(mcar_result$p.value > 0.05,
"Consistent with MCAR",
"Evidence against MCAR"), "\n")
Conclusion: Evidence against MCAR
The missing data pattern reveals three primary areas of concern:
Cabin Information (77.1% missing): The high
missingness rate for cabin data reflects the social stratification of
1912, where lower-class passengers often lacked assigned cabin
numbers.
Age Data (19.9% missing): Age information was
frequently unrecorded or inaccurate in historical passenger
manifests.
Embarkation Port (0.2% missing): Nearly complete
information, reflecting the importance of port documentation for
maritime records.
The MCAR test provides evidence about the nature of missingness
patterns, informing our imputation strategy.
Enhanced Data
Preprocessing with Multiple Imputation
# Multiple imputation for age variable
set.seed(42)
mice_imputation <- mice(titanic_raw %>% select(Age, Sex, Pclass, SibSp, Parch, Fare),
m = 5, method = 'pmm', printFlag = FALSE)
# Extract completed datasets and use the first one for main analysis
titanic_completed <- complete(mice_imputation, 1)
# Comprehensive data preprocessing with theoretical justification
titanic <- titanic_raw %>%
mutate(
# Use multiply imputed age
Age = titanic_completed$Age,
# Convert survival to factor for clarity
Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
# Convert passenger class to ordered factor (3rd class as reference)
Pclass = factor(Pclass, levels = c("3", "2", "1"), ordered = TRUE),
# Convert sex to factor
Sex = factor(Sex),
# Impute embarkation port with mode (Southampton)
Embarked = fct_na_value_to_level(factor(Embarked), level = "S"),
# Feature engineering: Family size as predictor of survival
FamilySize = SibSp + Parch + 1,
# Categorical family size for interpretation
FamilyCategory = case_when(
FamilySize == 1 ~ "Alone",
FamilySize %in% 2:4 ~ "Small",
FamilySize >= 5 ~ "Large"
)
)
# Compare imputation methods
age_comparison <- data.frame(
Method = c("Original (complete cases)", "Median Imputation", "Multiple Imputation"),
Mean = c(mean(titanic_raw$Age, na.rm=T),
median(titanic_raw$Age, na.rm=T),
mean(titanic$Age)),
SD = c(sd(titanic_raw$Age, na.rm=T),
sd(titanic_raw$Age, na.rm=T),
sd(titanic$Age))
)
age_comparison %>%
kable(digits = 2, caption = "Comparison of Age Imputation Methods",
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Comparison of Age Imputation Methods
|
Method
|
Mean
|
SD
|
|
Original (complete cases)
|
29.70
|
14.53
|
|
Median Imputation
|
28.00
|
14.53
|
|
Multiple Imputation
|
29.94
|
13.99
|
Exploratory Data
Analysis
Survival Patterns
by Key Predictors
# Calculate key survival statistics
survival_stats <- list(
Overall = mean(titanic$Survived == "Yes") * 100,
Female = mean(titanic$Survived[titanic$Sex == "female"] == "Yes") * 100,
Male = mean(titanic$Survived[titanic$Sex == "male"] == "Yes") * 100,
First_Class = mean(titanic$Survived[titanic$Pclass == "1"] == "Yes") * 100,
Second_Class = mean(titanic$Survived[titanic$Pclass == "2"] == "Yes") * 100,
Third_Class = mean(titanic$Survived[titanic$Pclass == "3"] == "Yes") * 100
)
# Create summary table
survival_summary <- data.frame(
Category = c("Overall", "Female", "Male", "First Class", "Second Class", "Third Class"),
Survival_Rate = round(unlist(survival_stats), 1)
)
survival_summary %>%
kable(caption = "Survival Rates by Passenger Category",
col.names = c("Category", "Survival Rate (%)"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Survival Rates by Passenger Category
|
|
Category
|
Survival Rate (%)
|
|
Overall
|
Overall
|
38.4
|
|
Female
|
Female
|
74.2
|
|
Male
|
Male
|
18.9
|
|
First_Class
|
First Class
|
63.0
|
|
Second_Class
|
Second Class
|
47.3
|
|
Third_Class
|
Third Class
|
24.2
|
Visualizing Key
Relationships
# Age distribution comparison
p1 <- ggplot(titanic, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = custom_colors[1], color = "white", alpha = 0.8) +
geom_vline(xintercept = median(titanic$Age), linetype = "dashed", color = "red") +
labs(title = "Age Distribution (Multiple Imputation)",
x = "Age (years)", y = "Frequency") +
theme_minimal()
# Family size effect
p2 <- ggplot(titanic, aes(x = factor(FamilySize), fill = Survived)) +
geom_bar(position = "fill", alpha = 0.8) +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = custom_colors[c(4, 3)]) +
labs(title = "Survival Rate by Family Size",
x = "Family Size", y = "Proportion") +
theme_minimal()
# Survival by class and sex
p3 <- ggplot(titanic, aes(x = Pclass, fill = Survived)) +
geom_bar(position = "fill", alpha = 0.8) +
facet_wrap(~Sex) +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = custom_colors[c(4, 3)]) +
labs(title = "Survival Rate by Class and Sex",
x = "Passenger Class", y = "Proportion") +
theme_minimal()
# Age vs survival
p4 <- ggplot(titanic, aes(x = Age, fill = Survived)) +
geom_density(alpha = 0.6) +
scale_fill_manual(values = custom_colors[c(4, 3)]) +
labs(title = "Age Distribution by Survival",
x = "Age (years)", y = "Density") +
theme_minimal()
print(p1)

print(p2)

print(p3)

print(p4)

Statistical Model
Specification
Theoretical
Foundation
Model Choice
Justification
We employ a hierarchical logistic regression model for several
theoretical reasons:
Binary Outcome: Survival is inherently binary,
making the Bernoulli distribution the natural choice.
Logit Link Function: The logit transformation
maps probabilities to the real line, enabling linear modeling of the
systematic component.
Hierarchical Structure: Passengers are naturally
grouped by embarkation port, creating correlation within groups that
violates independence assumptions of standard logistic
regression.
Random Effects: Port-specific random intercepts
capture unobserved heterogeneity (cultural differences, boarding
procedures, social composition).
Prior Sensitivity
Analysis
# Define different prior scales for sensitivity analysis
prior_scales <- c(0.001, 0.01, 0.1) # Different precisions in JAGS
prior_labels <- c("Strong (τ=0.001)", "Weak (τ=0.01)", "Very Weak (τ=0.1)")
cat("Prior Sensitivity Analysis:\n")
Prior Sensitivity Analysis:
cat("Testing regression coefficient priors with different precisions:\n")
Testing regression coefficient priors with different precisions:
for(i in 1:length(prior_scales)) {
sd_equiv <- sqrt(1/prior_scales[i])
cat(sprintf("- %s: N(0, %.3f) ⟺ SD = %.1f\n",
prior_labels[i], prior_scales[i], sd_equiv))
}
- Strong (τ=0.001): N(0, 0.001) ⟺ SD = 31.6
- Weak (τ=0.01): N(0, 0.010) ⟺ SD = 10.0
- Very Weak (τ=0.1): N(0, 0.100) ⟺ SD = 3.2
Data Preparation for
MCMC
# Prepare data list for JAGS with optimal transformations
data_list <- list(
N = nrow(titanic),
y = as.numeric(titanic$Survived == "Yes"),
age_std = as.numeric(scale(titanic$Age)),
fare_std = as.numeric(scale(titanic$Fare)),
family_size_std = as.numeric(scale(titanic$FamilySize)),
family_size_sq_std = as.numeric(scale(titanic$FamilySize^2)),
female = as.integer(titanic$Sex == "female"),
class_1 = as.integer(titanic$Pclass == "1"),
class_2 = as.integer(titanic$Pclass == "2"),
embarked_idx = as.integer(titanic$Embarked),
n_embarked = nlevels(titanic$Embarked)
)
# Display data structure
str(data_list)
List of 11
$ N : int 891
$ y : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
$ age_std : num [1:891] -0.567 0.577 -0.281 0.362 0.362 ...
$ fare_std : num [1:891] -0.502 0.786 -0.489 0.42 -0.486 ...
$ family_size_std : num [1:891] 0.0591 0.0591 -0.5607 0.0591 -0.5607 ...
$ family_size_sq_std: num [1:891] -0.158 -0.158 -0.37 -0.158 -0.37 ...
$ female : int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
$ class_1 : int [1:891] 0 1 0 1 0 0 1 0 0 0 ...
$ class_2 : int [1:891] 0 0 0 0 0 0 0 0 0 1 ...
$ embarked_idx : int [1:891] 3 1 3 3 3 2 3 3 3 1 ...
$ n_embarked : int 3
# Quality checks
cat("Quality Checks:\n")
Quality Checks:
cat("Standardized age range:", round(range(data_list$age_std), 2), "\n")
Standardized age range: -2.11 3.58
cat("Standardized fare range:", round(range(data_list$fare_std), 2), "\n")
Standardized fare range: -0.65 9.66
cat("Survival rate:", round(mean(data_list$y) * 100, 1), "%\n")
Survival rate: 38.4 %
cat("Embarkation port distribution:\n")
Embarkation port distribution:
print(table(data_list$embarked_idx))
1 2 3
168 77 646
MCMC Implementation and
Model Fitting
Base Model
Implementation
# Define hierarchical logistic regression model in JAGS syntax
model_base_string <- "
model {
# Likelihood specification
for(i in 1:N) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 +
beta_female * female[i] +
beta_age * age_std[i] +
beta_fare * fare_std[i] +
beta_family * family_size_std[i] +
beta_family_sq * family_size_sq_std[i] +
beta_class1 * class_1[i] +
beta_class2 * class_2[i] +
u_embarked[embarked_idx[i]]
}
# Prior specifications
beta0 ~ dnorm(0, 0.01) # precision = 0.01 => variance = 100
beta_female ~ dnorm(0, 0.01)
beta_age ~ dnorm(0, 0.01)
beta_fare ~ dnorm(0, 0.01)
beta_family ~ dnorm(0, 0.01)
beta_family_sq ~ dnorm(0, 0.01)
beta_class1 ~ dnorm(0, 0.01)
beta_class2 ~ dnorm(0, 0.01)
# Random effects for embarkation ports
for(j in 1:n_embarked) {
u_embarked[j] ~ dnorm(0, tau_u)
}
# Half-Cauchy prior for random effect standard deviation
tau_u <- pow(sigma_u, -2)
sigma_u ~ dt(0, pow(5, -2), 1) T(0,)
}
"
# Clear any existing models
if(exists("mod_base")) rm(mod_base)
if(exists("samples_base")) rm(samples_base)
# Compile and initialize model
set.seed(42)
mod_base <- jags.model(
textConnection(model_base_string),
data = data_list,
n.chains = 4,
n.adapt = 5000,
quiet = TRUE
)
# Burn-in period
update(mod_base, n.iter = 15000)
# MCMC sampling
samples_base <- coda.samples(
mod_base,
variable.names = c("beta0", "beta_female", "beta_age", "beta_fare",
"beta_family", "beta_family_sq", "beta_class1",
"beta_class2", "sigma_u", "u_embarked"),
n.iter = 30000,
thin = 20
)
cat("MCMC sampling completed: 4 chains × 30,000 iterations (thinned by 20)\n")
MCMC sampling completed: 4 chains × 30,000 iterations (thinned by 20)
cat("Total posterior samples: 6,000\n")
Total posterior samples: 6,000
Alternative Model
with Interactions
# Extended model with gender × class interactions
model_alt_string <- "
model {
for(i in 1:N) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 +
beta_female * female[i] +
beta_age * age_std[i] +
beta_fare * fare_std[i] +
beta_family * family_size_std[i] +
beta_family_sq * family_size_sq_std[i] +
beta_class1 * class_1[i] +
beta_class2 * class_2[i] +
beta_int_fem_c1 * female[i] * class_1[i] +
beta_int_fem_c2 * female[i] * class_2[i] +
u_embarked[embarked_idx[i]]
}
# Same priors as base model plus interaction terms
beta0 ~ dnorm(0, 0.01); beta_female ~ dnorm(0, 0.01); beta_age ~ dnorm(0, 0.01)
beta_fare ~ dnorm(0, 0.01); beta_family ~ dnorm(0, 0.01); beta_family_sq ~ dnorm(0, 0.01)
beta_class1 ~ dnorm(0, 0.01); beta_class2 ~ dnorm(0, 0.01)
beta_int_fem_c1 ~ dnorm(0, 0.01); beta_int_fem_c2 ~ dnorm(0, 0.01)
for(j in 1:n_embarked) { u_embarked[j] ~ dnorm(0, tau_u) }
tau_u <- pow(sigma_u, -2)
sigma_u ~ dt(0, pow(5, -2), 1) T(0,)
}
"
# Clear existing alternative model
if(exists("mod_alt")) rm(mod_alt)
if(exists("samples_alt")) rm(samples_alt)
# Compile alternative model
mod_alt <- jags.model(
textConnection(model_alt_string),
data = data_list,
n.chains = 4,
n.adapt = 5000,
quiet = TRUE
)
update(mod_alt, n.iter = 15000)
samples_alt <- coda.samples(
mod_alt,
variable.names = c("beta0", "beta_female", "beta_age", "beta_fare",
"beta_family", "beta_family_sq", "beta_class1",
"beta_class2", "beta_int_fem_c1", "beta_int_fem_c2",
"sigma_u", "u_embarked"),
n.iter = 30000,
thin = 20
)
Prior Sensitivity
Analysis Implementation
# Fit model with different prior scales for sensitivity analysis
fit_sensitivity_model <- function(prior_precision, label) {
model_string <- paste0("
model {
for(i in 1:N) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 +
beta_female * female[i] +
beta_age * age_std[i] +
beta_fare * fare_std[i] +
beta_family * family_size_std[i] +
beta_family_sq * family_size_sq_std[i] +
beta_class1 * class_1[i] +
beta_class2 * class_2[i] +
u_embarked[embarked_idx[i]]
}
# Priors with specified precision
beta0 ~ dnorm(0, ", prior_precision, ")
beta_female ~ dnorm(0, ", prior_precision, ")
beta_age ~ dnorm(0, ", prior_precision, ")
beta_fare ~ dnorm(0, ", prior_precision, ")
beta_family ~ dnorm(0, ", prior_precision, ")
beta_family_sq ~ dnorm(0, ", prior_precision, ")
beta_class1 ~ dnorm(0, ", prior_precision, ")
beta_class2 ~ dnorm(0, ", prior_precision, ")
for(j in 1:n_embarked) { u_embarked[j] ~ dnorm(0, tau_u) }
tau_u <- pow(sigma_u, -2)
sigma_u ~ dt(0, pow(5, -2), 1) T(0,)
}")
mod <- jags.model(textConnection(model_string), data = data_list,
n.chains = 2, n.adapt = 2000, quiet = TRUE)
update(mod, n.iter = 5000)
samples <- coda.samples(mod,
variable.names = c("beta_female", "beta_class1"),
n.iter = 10000, thin = 10)
return(samples)
}
# Run sensitivity analysis for key parameters
sensitivity_results <- list()
for(i in 1:length(prior_scales)) {
sensitivity_results[[i]] <- fit_sensitivity_model(prior_scales[i], prior_labels[i])
}
# Compare results
sensitivity_summary <- data.frame(
Prior = prior_labels,
Beta_Female_Mean = sapply(sensitivity_results, function(x) mean(as.matrix(x)[,"beta_female"])),
Beta_Female_SD = sapply(sensitivity_results, function(x) sd(as.matrix(x)[,"beta_female"])),
Beta_Class1_Mean = sapply(sensitivity_results, function(x) mean(as.matrix(x)[,"beta_class1"])),
Beta_Class1_SD = sapply(sensitivity_results, function(x) sd(as.matrix(x)[,"beta_class1"]))
)
sensitivity_summary %>%
kable(caption = "Prior Sensitivity Analysis Results", digits = 3,
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Prior Sensitivity Analysis Results
|
Prior
|
Beta_Female_Mean
|
Beta_Female_SD
|
Beta_Class1_Mean
|
Beta_Class1_SD
|
|
Strong (τ=0.001)
|
2.691
|
0.197
|
2.054
|
0.301
|
|
Weak (τ=0.01)
|
2.684
|
0.205
|
2.058
|
0.303
|
|
Very Weak (τ=0.1)
|
2.671
|
0.198
|
2.036
|
0.299
|
Results and Posterior
Inference
Posterior Summary
Statistics
# Extract posterior summaries for main parameters
main_params <- c("beta0", "beta_female", "beta_age", "beta_fare",
"beta_family", "beta_family_sq", "beta_class1",
"beta_class2", "sigma_u")
posterior_summary <- summary(samples_base[, main_params])
# Create formatted results table
posterior_table <- data.frame(
Parameter = rownames(posterior_summary$statistics),
Mean = round(posterior_summary$statistics[, "Mean"], 3),
SD = round(posterior_summary$statistics[, "SD"], 3),
CI_2.5 = round(posterior_summary$quantiles[, "2.5%"], 3),
CI_97.5 = round(posterior_summary$quantiles[, "97.5%"], 3)
) %>%
mutate(
Significant = ifelse(sign(CI_2.5) == sign(CI_97.5), "Yes", "No")
)
posterior_table %>%
kable(caption = "Posterior Summary Statistics - Base Model",
col.names = c("Parameter", "Mean", "SD", "2.5%", "97.5%", "Significant"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Posterior Summary Statistics - Base Model
|
|
Parameter
|
Mean
|
SD
|
2.5%
|
97.5%
|
Significant
|
|
beta0
|
beta0
|
-2.245
|
0.662
|
-3.390
|
-0.731
|
Yes
|
|
beta_female
|
beta_female
|
2.689
|
0.202
|
2.296
|
3.087
|
Yes
|
|
beta_age
|
beta_age
|
-0.471
|
0.108
|
-0.683
|
-0.265
|
Yes
|
|
beta_fare
|
beta_fare
|
0.138
|
0.126
|
-0.095
|
0.402
|
No
|
|
beta_family
|
beta_family
|
0.829
|
0.407
|
0.064
|
1.656
|
Yes
|
|
beta_family_sq
|
beta_family_sq
|
-1.635
|
0.547
|
-2.761
|
-0.647
|
Yes
|
|
beta_class1
|
beta_class1
|
2.051
|
0.303
|
1.457
|
2.659
|
Yes
|
|
beta_class2
|
beta_class2
|
1.074
|
0.241
|
0.606
|
1.549
|
Yes
|
|
sigma_u
|
sigma_u
|
0.710
|
1.014
|
0.021
|
3.655
|
Yes
|
Odds Ratio
Interpretation
# Calculate odds ratios and credible intervals
posterior_means <- colMeans(as.matrix(samples_base[, main_params]))
odds_ratios <- exp(posterior_means[grep("beta", names(posterior_means))])
# Odds ratio credible intervals
or_ci_lower <- exp(posterior_table$CI_2.5[1:8])
or_ci_upper <- exp(posterior_table$CI_97.5[1:8])
# Create interpretation table
interpretation_data <- data.frame(
Variable = c("Intercept", "Female vs Male", "Age (per SD)",
"Fare (per SD)", "Family Size (per SD)",
"Family Size² (per SD)", "1st vs 3rd Class", "2nd vs 3rd Class"),
Odds_Ratio = round(odds_ratios, 2),
CI_Lower = round(or_ci_lower, 2),
CI_Upper = round(or_ci_upper, 2)
)
interpretation_data %>%
kable(caption = "Odds Ratios with 95% Credible Intervals",
col.names = c("Variable", "Odds Ratio", "CI Lower", "CI Upper"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Odds Ratios with 95% Credible Intervals
|
|
Variable
|
Odds Ratio
|
CI Lower
|
CI Upper
|
|
beta0
|
Intercept
|
0.11
|
0.03
|
0.48
|
|
beta_female
|
Female vs Male
|
14.71
|
9.93
|
21.91
|
|
beta_age
|
Age (per SD)
|
0.62
|
0.51
|
0.77
|
|
beta_fare
|
Fare (per SD)
|
1.15
|
0.91
|
1.49
|
|
beta_family
|
Family Size (per SD)
|
2.29
|
1.07
|
5.24
|
|
beta_family_sq
|
Family Size² (per SD)
|
0.20
|
0.06
|
0.52
|
|
beta_class1
|
1st vs 3rd Class
|
7.77
|
4.29
|
14.28
|
|
beta_class2
|
2nd vs 3rd Class
|
2.93
|
1.83
|
4.71
|
# Key findings summary
cat("Key Findings:\n")
Key Findings:
cat("Female survival odds:", round((odds_ratios[2] - 1) * 100, 0), "% higher than males\n")
Female survival odds: 1371 % higher than males
cat("1st class survival odds:", round((odds_ratios[7] - 1) * 100, 0), "% higher than 3rd class\n")
1st class survival odds: 677 % higher than 3rd class
cat("2nd class survival odds:", round((odds_ratios[8] - 1) * 100, 0), "% higher than 3rd class\n")
2nd class survival odds: 193 % higher than 3rd class
Interaction Effects
Analysis
# Extract interaction parameters from alternative model
interaction_summary <- summary(samples_alt[, c("beta_int_fem_c1", "beta_int_fem_c2")])
interaction_table <- data.frame(
Interaction = c("Female × 1st Class", "Female × 2nd Class"),
Mean = round(interaction_summary$statistics[, "Mean"], 3),
SD = round(interaction_summary$statistics[, "SD"], 3),
CI_2.5 = round(interaction_summary$quantiles[, "2.5%"], 3),
CI_97.5 = round(interaction_summary$quantiles[, "97.5%"], 3)
)
interaction_table %>%
kable(caption = "Gender × Class Interaction Effects",
col.names = c("Interaction", "Mean", "SD", "2.5%", "97.5%"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Gender × Class Interaction Effects
|
|
Interaction
|
Mean
|
SD
|
2.5%
|
97.5%
|
|
beta_int_fem_c1
|
Female × 1st Class
|
2.142
|
0.717
|
0.863
|
3.657
|
|
beta_int_fem_c2
|
Female × 2nd Class
|
2.551
|
0.579
|
1.472
|
3.765
|
# Calculate survival probabilities for different groups
calculate_survival_probs <- function(samples, female, class1, class2) {
samples_mat <- as.matrix(samples)
# Use mean values for other variables
age_std_mean <- 0 # standardized
fare_std_mean <- 0
family_size_std_mean <- 0
family_size_sq_std_mean <- 0
u_embarked_mean <- 0 # assume Southampton (most common)
eta <- samples_mat[,"beta0"] +
samples_mat[,"beta_female"] * female +
samples_mat[,"beta_age"] * age_std_mean +
samples_mat[,"beta_fare"] * fare_std_mean +
samples_mat[,"beta_family"] * family_size_std_mean +
samples_mat[,"beta_family_sq"] * family_size_sq_std_mean +
samples_mat[,"beta_class1"] * class1 +
samples_mat[,"beta_class2"] * class2 +
samples_mat[,"beta_int_fem_c1"] * female * class1 +
samples_mat[,"beta_int_fem_c2"] * female * class2 +
u_embarked_mean
return(plogis(eta))
}
# Calculate survival probabilities for all gender-class combinations
survival_probs <- data.frame(
Group = c("Male 3rd Class", "Male 2nd Class", "Male 1st Class",
"Female 3rd Class", "Female 2nd Class", "Female 1st Class"),
Mean_Probability = c(
mean(calculate_survival_probs(samples_alt, 0, 0, 0)),
mean(calculate_survival_probs(samples_alt, 0, 0, 1)),
mean(calculate_survival_probs(samples_alt, 0, 1, 0)),
mean(calculate_survival_probs(samples_alt, 1, 0, 0)),
mean(calculate_survival_probs(samples_alt, 1, 0, 1)),
mean(calculate_survival_probs(samples_alt, 1, 1, 0))
)
)
survival_probs %>%
mutate(Mean_Probability = round(Mean_Probability, 3)) %>%
kable(caption = "Predicted Survival Probabilities by Group",
col.names = c("Group", "Survival Probability"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Predicted Survival Probabilities by Group
|
Group
|
Survival Probability
|
|
Male 3rd Class
|
0.149
|
|
Male 2nd Class
|
0.174
|
|
Male 1st Class
|
0.441
|
|
Female 3rd Class
|
0.487
|
|
Female 2nd Class
|
0.909
|
|
Female 1st Class
|
0.954
|
MCMC Convergence
Diagnostics
# Calculate convergence diagnostics
rhat_values <- gelman.diag(samples_base[, main_params], multivariate = FALSE)$psrf
ess_values <- effectiveSize(samples_base[, main_params])
# Create diagnostics table
convergence_diagnostics <- data.frame(
Parameter = names(ess_values),
R_hat = round(rhat_values[, 1], 4),
ESS = round(ess_values, 0),
ESS_per_chain = round(ess_values / 4, 0)
)
convergence_diagnostics %>%
kable(caption = "MCMC Convergence Diagnostics",
col.names = c("Parameter", "R̂", "ESS", "ESS per Chain"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
MCMC Convergence Diagnostics
|
|
Parameter
|
R
|
ES
|
|
beta0
|
beta0
|
1.0613
|
325
|
81
|
|
beta_female
|
beta_female
|
0.9998
|
5773
|
1443
|
|
beta_age
|
beta_age
|
1.0002
|
5838
|
1460
|
|
beta_fare
|
beta_fare
|
1.0010
|
6000
|
1500
|
|
beta_family
|
beta_family
|
1.0017
|
2799
|
700
|
|
beta_family_sq
|
beta_family_sq
|
1.0015
|
2785
|
696
|
|
beta_class1
|
beta_class1
|
1.0007
|
6074
|
1518
|
|
beta_class2
|
beta_class2
|
1.0001
|
5645
|
1411
|
|
sigma_u
|
sigma_u
|
1.0175
|
690
|
173
|
# Trace plots for key parameters
color_scheme_set("viridis")
mcmc_trace(samples_base, pars = main_params[1:4]) +
ggtitle("MCMC Trace Plots") +
theme(legend.position = "none")

# Density plots
mcmc_dens(samples_base, pars = main_params[1:4]) +
ggtitle("Posterior Density Plots")

Enhanced Model
Comparison and Selection
# Calculate DIC for both models
dic_base <- dic.samples(mod_base, n.iter = 20000, type = "pD")
dic_alt <- dic.samples(mod_alt, n.iter = 20000, type = "pD")
# Extract DIC components
dic_base_value <- sum(dic_base$deviance) + sum(dic_base$penalty)
dic_alt_value <- sum(dic_alt$deviance) + sum(dic_alt$penalty)
delta_dic <- dic_alt_value - dic_base_value
# Calculate Bayes factors (approximate)
# Using DIC as approximation: BF ≈ exp(-0.5 * ΔDIC)
bayes_factor <- exp(-0.5 * delta_dic)
# Model comparison table
model_comparison <- data.frame(
Model = c("Base Model", "Model with Interactions"),
Deviance = c(sum(dic_base$deviance), sum(dic_alt$deviance)),
pD = c(sum(dic_base$penalty), sum(dic_alt$penalty)),
DIC = c(dic_base_value, dic_alt_value),
Delta_DIC = c(0, delta_dic),
Bayes_Factor = c(1, bayes_factor)
)
model_comparison %>%
kable(caption = "Enhanced Model Comparison",
digits = 2,
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Enhanced Model Comparison
|
Model
|
Deviance
|
pD
|
DIC
|
Delta_DIC
|
Bayes_Factor
|
|
Base Model
|
787.99
|
9.55
|
797.54
|
0.00
|
1.0
|
|
Model with Interactions
|
761.71
|
11.75
|
773.46
|
-24.08
|
169230.8
|
# Evidence interpretation
evidence_strength <- case_when(
abs(delta_dic) <= 2 ~ "Weak evidence",
abs(delta_dic) <= 5 ~ "Moderate evidence",
abs(delta_dic) <= 10 ~ "Strong evidence",
TRUE ~ "Very strong evidence"
)
bayes_factor_interpretation <- case_when(
bayes_factor < 1/10 ~ "Strong evidence for base model",
bayes_factor < 1/3 ~ "Moderate evidence for base model",
bayes_factor < 3 ~ "Weak evidence either way",
bayes_factor < 10 ~ "Moderate evidence for interaction model",
TRUE ~ "Strong evidence for interaction model"
)
cat("Model Selection Results:\n")
Model Selection Results:
cat("ΔDIC (Interaction - Base):", round(delta_dic, 2), "\n")
ΔDIC (Interaction - Base): -24.08
cat("Evidence strength:", evidence_strength, "\n")
Evidence strength: Very strong evidence
cat("Bayes factor:", round(bayes_factor, 2), "\n")
Bayes factor: 169230.8
cat("BF interpretation:", bayes_factor_interpretation, "\n")
BF interpretation: Strong evidence for interaction model
# Select best model
best_model <- if_else(delta_dic < -2, "interaction", "base")
best_samples <- if_else(delta_dic < -2, list(samples_alt), list(samples_base))[[1]]
cat("Selected model:", best_model, "\n")
Selected model: interaction
# Model averaging weights (if evidence is not strong)
if(abs(delta_dic) <= 5) {
weight_base <- exp(-0.5 * 0) / (exp(-0.5 * 0) + exp(-0.5 * delta_dic))
weight_alt <- exp(-0.5 * delta_dic) / (exp(-0.5 * 0) + exp(-0.5 * delta_dic))
cat("\nModel averaging weights:\n")
cat("Base model weight:", round(weight_base, 3), "\n")
cat("Interaction model weight:", round(weight_alt, 3), "\n")
}
Model Validation and
Posterior Predictive Checks
# Enhanced posterior predictive check function
generate_replicated_data <- function(samples_matrix, data_list, model_type = "base") {
n_samples <- min(nrow(samples_matrix), 500)
sample_indices <- sample(nrow(samples_matrix), n_samples)
samples_subset <- samples_matrix[sample_indices, ]
n_obs <- data_list$N
y_rep <- matrix(NA, n_samples, n_obs)
for (i in 1:n_samples) {
s <- samples_subset[i, ]
# Extract random effects properly
u_effects <- s[grep("^u_embarked\\[", names(s))]
if(length(u_effects) == 3) {
random_effects <- u_effects[data_list$embarked_idx]
} else {
random_effects <- rep(0, length(data_list$embarked_idx))
}
eta <- s["beta0"] +
s["beta_female"] * data_list$female +
s["beta_age"] * data_list$age_std +
s["beta_fare"] * data_list$fare_std +
s["beta_family"] * data_list$family_size_std +
s["beta_family_sq"] * data_list$family_size_sq_std +
s["beta_class1"] * data_list$class_1 +
s["beta_class2"] * data_list$class_2 +
random_effects
if (model_type == "interaction") {
eta <- eta + s["beta_int_fem_c1"] * data_list$female * data_list$class_1 +
s["beta_int_fem_c2"] * data_list$female * data_list$class_2
}
p <- plogis(eta)
y_rep[i, ] <- rbinom(n_obs, 1, p)
}
return(y_rep)
}
# Generate replicated data
samples_matrix <- as.matrix(best_samples)
model_type <- if_else(best_model == "interaction", "interaction", "base")
y_rep <- generate_replicated_data(samples_matrix, data_list, model_type)
# Multiple posterior predictive checks
# 1. Overall survival rate
obs_survival_rate <- mean(data_list$y)
rep_survival_rates <- rowMeans(y_rep)
p_value_global <- mean(rep_survival_rates > obs_survival_rate)
p1 <- ggplot(data.frame(rate = rep_survival_rates), aes(x = rate)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = custom_colors[1], alpha = 0.7) +
geom_density(color = custom_colors[2], linewidth = 1) +
geom_vline(xintercept = obs_survival_rate, color = "red",
linetype = "dashed", linewidth = 1) +
labs(title = "PPC: Overall Survival Rate",
x = "Replicated Survival Rate", y = "Density") +
theme_minimal()
# 2. Survival by sex
obs_female_survival <- mean(data_list$y[data_list$female == 1])
obs_male_survival <- mean(data_list$y[data_list$female == 0])
rep_female_survival <- rowMeans(y_rep[, data_list$female == 1])
rep_male_survival <- rowMeans(y_rep[, data_list$female == 0])
p2 <- ggplot() +
geom_density(aes(rep_female_survival), fill = custom_colors[1], alpha = 0.6) +
geom_density(aes(rep_male_survival), fill = custom_colors[2], alpha = 0.6) +
geom_vline(xintercept = obs_female_survival, color = custom_colors[1], linetype = "dashed") +
geom_vline(xintercept = obs_male_survival, color = custom_colors[2], linetype = "dashed") +
labs(title = "PPC: Survival by Sex", x = "Survival Rate", y = "Density") +
theme_minimal()
# 3. Survival by class
class_groups <- list("3rd" = data_list$class_1 == 0 & data_list$class_2 == 0,
"2nd" = data_list$class_2 == 1,
"1st" = data_list$class_1 == 1)
obs_class_survival <- sapply(class_groups, function(idx) mean(data_list$y[idx]))
rep_class_survival <- sapply(class_groups, function(idx) rowMeans(y_rep[, idx]))
class_ppc_data <- data.frame(
Class = rep(names(obs_class_survival), each = nrow(y_rep)),
Replicated = c(rep_class_survival),
Observed = rep(obs_class_survival, each = nrow(y_rep))
)
p3 <- ggplot(class_ppc_data, aes(x = Replicated)) +
geom_density(fill = custom_colors[1], alpha = 0.6) +
geom_vline(aes(xintercept = Observed), color = "red", linetype = "dashed") +
facet_wrap(~Class) +
labs(title = "PPC: Survival by Class", x = "Survival Rate", y = "Density") +
theme_minimal()
print(p1)

print(p2)

print(p3)

# Summary of PPC results
cat("Posterior Predictive Check Results:\n")
Posterior Predictive Check Results:
cat("Overall survival - Observed:", round(obs_survival_rate, 3),
", Bayesian p-value:", round(p_value_global, 3), "\n")
Overall survival - Observed: 0.384 , Bayesian p-value: 0.508
cat("Female survival - Observed:", round(obs_female_survival, 3),
", p-value:", round(mean(rep_female_survival > obs_female_survival), 3), "\n")
Female survival - Observed: 0.742 , p-value: 0.474
cat("Male survival - Observed:", round(obs_male_survival, 3),
", p-value:", round(mean(rep_male_survival > obs_male_survival), 3), "\n")
Male survival - Observed: 0.189 , p-value: 0.492
Comparison with
Frequentist Methods
# Fit comparable frequentist models
glm_model <- glm(
y ~ female + age_std + fare_std + family_size_std +
family_size_sq_std + class_1 + class_2,
family = binomial(link = "logit"),
data = data.frame(data_list)
)
# Extract Bayesian results
bayes_summary <- summary(samples_base[, main_params[1:8]])
bayes_results <- data.frame(
Parameter = rownames(bayes_summary$statistics),
Estimate = bayes_summary$statistics[, "Mean"],
Lower_CI = bayes_summary$quantiles[, "2.5%"],
Upper_CI = bayes_summary$quantiles[, "97.5%"],
Method = "Bayesian"
)
# Extract frequentist GLM results
glm_confint <- suppressMessages(confint(glm_model))
freq_glm_results <- data.frame(
Parameter = paste0("beta", c("0", "_female", "_age", "_fare", "_family",
"_family_sq", "_class1", "_class2")),
Estimate = coef(glm_model),
Lower_CI = glm_confint[, 1],
Upper_CI = glm_confint[, 2],
Method = "Frequentist (GLM)"
)
# Mixed effects model with proper error handling
freq_glmer_results <- NULL
tryCatch({
glmer_model <- glmer(
y ~ female + age_std + fare_std + family_size_std +
family_size_sq_std + class_1 + class_2 + (1|embarked_idx),
family = binomial(link = "logit"),
data = data.frame(data_list),
control = glmerControl(optimizer = "bobyqa")
)
# Extract GLMER confidence intervals with proper indexing
glmer_confint <- suppressMessages(confint(glmer_model, method = "Wald"))
# Check dimensions and identify fixed effects rows
n_params <- length(fixef(glmer_model))
total_rows <- nrow(glmer_confint)
if(total_rows >= n_params) {
# Fixed effects are typically the last n_params rows
fixed_effect_rows <- (total_rows - n_params + 1):total_rows
freq_glmer_results <- data.frame(
Parameter = paste0("beta", c("0", "_female", "_age", "_fare", "_family",
"_family_sq", "_class1", "_class2")),
Estimate = fixef(glmer_model),
Lower_CI = glmer_confint[fixed_effect_rows, 1],
Upper_CI = glmer_confint[fixed_effect_rows, 2],
Method = "Frequentist (GLMER)"
)
} else {
# Fallback: use standard errors for confidence intervals
se_vals <- sqrt(diag(vcov(glmer_model)))
estimates <- fixef(glmer_model)
freq_glmer_results <- data.frame(
Parameter = paste0("beta", c("0", "_female", "_age", "_fare", "_family",
"_family_sq", "_class1", "_class2")),
Estimate = estimates,
Lower_CI = estimates - 1.96 * se_vals,
Upper_CI = estimates + 1.96 * se_vals,
Method = "Frequentist (GLMER)"
)
}
}, error = function(e) {
cat("GLMER model failed:", e$message, "\n")
cat("Proceeding with GLM comparison only\n")
})
# Combine results (include GLMER only if successful)
if(!is.null(freq_glmer_results)) {
all_comparisons <- rbind(bayes_results, freq_glm_results, freq_glmer_results)
n_methods <- 3
} else {
all_comparisons <- rbind(bayes_results, freq_glm_results)
n_methods <- 2
}
# Comparison plot
ggplot(all_comparisons, aes(x = Parameter, y = Estimate, color = Method)) +
geom_point(position = position_dodge(width = 0.5), size = 2) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI),
width = 0.2, position = position_dodge(width = 0.5)) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
coord_flip() +
scale_color_manual(values = custom_colors[1:n_methods]) +
labs(title = "Bayesian vs. Frequentist Parameter Estimates",
x = "Parameter", y = "Coefficient Estimate") +
theme_minimal()

# Comparison table (adapt based on available methods)
if(!is.null(freq_glmer_results)) {
comparison_summary <- all_comparisons %>%
select(Parameter, Method, Estimate) %>%
pivot_wider(names_from = Method, values_from = Estimate) %>%
mutate(
Bayes_vs_GLM = abs(Bayesian - `Frequentist (GLM)`),
Bayes_vs_GLMER = abs(Bayesian - `Frequentist (GLMER)`)
)
} else {
comparison_summary <- all_comparisons %>%
select(Parameter, Method, Estimate) %>%
pivot_wider(names_from = Method, values_from = Estimate) %>%
mutate(
Bayes_vs_GLM = abs(Bayesian - `Frequentist (GLM)`)
)
}
comparison_summary %>%
kable(caption = "Parameter Estimate Comparison Across Methods", digits = 3,
booktabs = TRUE) %>%
kable_styling(latex_options = c("striped", "hold_position"))
Parameter Estimate Comparison Across Methods
|
Parameter
|
Bayesian
|
Frequentist (GLM)
|
Frequentist (GLMER)
|
Bayes_vs_GLM
|
Bayes_vs_GLMER
|
|
beta0
|
-2.245
|
-2.338
|
-2.338
|
0.093
|
0.093
|
|
beta_female
|
2.689
|
2.681
|
2.681
|
0.008
|
0.008
|
|
beta_age
|
-0.471
|
-0.463
|
-0.463
|
0.008
|
0.008
|
|
beta_fare
|
0.138
|
0.139
|
0.139
|
0.000
|
0.000
|
|
beta_family
|
0.829
|
0.811
|
0.811
|
0.019
|
0.019
|
|
beta_family_sq
|
-1.635
|
-1.605
|
-1.605
|
0.029
|
0.029
|
|
beta_class1
|
2.051
|
2.037
|
2.037
|
0.013
|
0.013
|
|
beta_class2
|
1.074
|
1.006
|
1.006
|
0.068
|
0.068
|
Discussion and
Conclusions
Primary
Findings
This enhanced Bayesian hierarchical analysis of Titanic passenger
survival has revealed several key insights:
Gender as Primary Determinant: The “women and
children first” maritime protocol was rigorously implemented, with
female passengers experiencing dramatically higher survival
odds.
Social Stratification Effects: Clear
hierarchical patterns emerged, with first-class passengers having
substantially higher survival odds than third-class passengers.
Age-Related Vulnerability: Older passengers
faced systematically reduced survival prospects, likely reflecting
physical mobility constraints during evacuation.
Non-linear Family Effects: Medium-sized families
showed optimal survival rates, suggesting a balance between mutual
assistance and coordination difficulties.
Interaction Effects: The analysis suggests that
class differences in survival may vary by gender, indicating complex
social dynamics during the evacuation.
Port-Specific Variation: The hierarchical model
captured meaningful variation across embarkation ports.
Methodological
Enhancements
This study demonstrates several methodological improvements over the
initial analysis:
- Multiple Imputation: More sophisticated handling of
missing age data
- Prior Sensitivity Analysis: Robust inference across
different prior specifications
- Enhanced Model Comparison: Multiple criteria
including DIC and Bayes factors
- Comprehensive Validation: Multiple posterior
predictive checks
- Uncertainty Quantification: Complete posterior
distributions for all parameters
Model Selection and
Averaging
# Final model selection summary
cat("Final Model Selection Summary:\n")
Final Model Selection Summary:
cat("==================================\n")
==================================
cat("Selected model based on DIC:", best_model, "\n")
Selected model based on DIC: interaction
cat("Evidence strength:", evidence_strength, "\n")
Evidence strength: Very strong evidence
if(abs(delta_dic) <= 5) {
cat("\nNote: Evidence is not overwhelming, consider model averaging\n")
cat("Base model weight:", round(weight_base, 3), "\n")
cat("Interaction model weight:", round(weight_alt, 3), "\n")
}
# Key substantive findings
cat("\nKey Substantive Findings:\n")
Key Substantive Findings:
cat("========================\n")
========================
cat("1. Female passengers: ~15x higher survival odds\n")
1. Female passengers: ~15x higher survival odds
cat("2. 1st class vs 3rd class: ~8x higher survival odds\n")
2. 1st class vs 3rd class: ~8x higher survival odds
cat("3. Age effect: Significant negative impact\n")
3. Age effect: Significant negative impact
cat("4. Family size: Non-linear relationship (optimal at medium sizes)\n")
4. Family size: Non-linear relationship (optimal at medium sizes)
cat("5. Embarkation port: Meaningful random variation\n")
5. Embarkation port: Meaningful random variation
if(best_model == "interaction") {
cat("6. Gender-class interactions: Statistically significant\n")
}
6. Gender-class interactions: Statistically significant
Limitations and
Future Directions
Several limitations merit consideration:
- Missing Data Complexity: While multiple imputation
improves upon median imputation, more sophisticated missing data models
could be considered
- Causal Inference: The observational nature
precludes definitive causal claims about survival determinants
- External Validity: Generalization to other
emergency scenarios requires careful consideration of context
- Temporal Dynamics: The analysis doesn’t account for
the temporal sequence of evacuation events
Broader
Implications
This analysis extends beyond historical interest, providing insights
for:
- Emergency Response Planning: Understanding
demographic factors in evacuation success
- Social Inequality Research: Quantifying how social
stratification affects outcomes in crises
- Bayesian Methodology: Demonstrating best practices
for hierarchical modeling
- Historical Demography: Contributing to
understanding of early 20th century social structures
Methodological
Contributions to Bayesian Practice
This analysis demonstrates several important aspects of modern
Bayesian methodology:
- Hierarchical Modeling: Proper treatment of grouped
data structures
- Prior Specification: Theoretically justified and
sensitivity-tested priors
- Model Comparison: Multiple criteria for robust
model selection
- Validation: Comprehensive posterior predictive
checking
- Uncertainty Communication: Full probabilistic
inference and reporting
The convergence of Bayesian and frequentist results strengthens
confidence in our conclusions while highlighting the additional insights
available through Bayesian methodology. The complete uncertainty
quantification and natural handling of hierarchical structures make
Bayesian approaches particularly well-suited for this type of historical
and social scientific inquiry.
References
Gelman, A. (2006). Prior distributions for variance parameters in
hierarchical models. Bayesian Analysis, 1(3), 515-534.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A.,
& Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.).
Chapman and Hall/CRC.
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis
with Missing Data (3rd ed.). John Wiley & Sons.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian
graphical models using Gibbs sampling. Proceedings of the 3rd
International Workshop on Distributed Statistical Computing.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der
Linde, A. (2002). Bayesian measures of model complexity and fit.
Journal of the Royal Statistical Society Series B, 64(4),
583-639.
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice:
Multivariate imputation by chained equations in R. Journal of
Statistical Software, 45(3), 1-67.
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian
model evaluation using leave-one-out cross-validation and WAIC.
Statistics and Computing, 27(5), 1413-1432.